In this project I will be practising how to work with R. It is part of the Open Data Science Course. Check out my github repository.
In this exercise I have been working on a data set about students learning motivation. In preparation I transformed the raw data into a data set suitable for analysis. That process is called data wrangling. After that I was analysing the data statistically in order to investigate the relationship between certain variables.
learning2014 <- read.csv("/Users/eva-mariaroth/Documents/GitHub/IODS-project/learning2014.csv", header = TRUE)
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## gender Age attitude deep stra surf Points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
The dataset consists of 166 observations of 7 variables (columns) related to the learning behaviour of students. The variables are:
Gender : character values, M, F, for male and female
Age : age of the students, numeric data type
Attitude : Global attitude towards statistics, numeric data type, likert scale 1-5
Deep : multiple questions combined that explore deep learning, numeric data type, likert scale 1-5
Stra: multiple questions comnined that explore strategic learning, numeric data type, likert scale 1-5
Surf: multiple questions combined that explore surface learning, numeric data type, likert scale 1-5
Points: points of students, numeric data type
Installing and accessing the packages ggplot2, which is used to create graphics and GGally, an extension to ggplot2.
install.packages("ggplot2")
install.packages("GGally")
library("ggplot2")
library("GGally")
Exploring the data by printing out a summary:
summary(learning2014)
## gender Age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Now we can get an overview about our data. There were 110 women and 56 men questioned for the data set, with an average age of about 25.5 years, the youngest beeing 17 and the oldest 55 years old. The Median of the age is 22. Most of the students are between 21 and 27 years old. The averaged attitude towards statistics is 3,14. The format ov the usual likert scale is:
1 Strongly disagree
2 Disagree
3 Neither agree nor disagree
4 Agree
5 Strongly agree
The mean of the strateic learning is about 3,68 the one considering strategic learning is about 3,12 and the mean of the questions considering surface learning is 2,78. The average amount of points is 22,72 with the minimum amount beeing 7 points and the maximal amount 33.
Exploring the data regarding the relationships between them:
pairs(learning2014[-1], col = learning2014$gender)
The red circles represent the male and the black the female participants and their attitude towards learning. In some of these scatter plots you can already assume a correlation. For example the number of points also seems to increase when the attitude towards statistics increases. A more detailed overview about the variables, distributions and relationships we can get using ggplot2 and GGally.
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
In this plot matrix I can for example see how strongly two variable are correlating. There is a correlation between attitude and points with a correlation factor of 0.451. There is little or no correlation between the points and the deep learning. I can read this from the small correlation factor of 0.01.
If we want to examine the strength of the relationship between several variables (explanatory variables) on one target variable (dependet varaible) we can do this with a regression model. The model is written with the formula y ~ x1 + x2 + x3.
Alpha corresponds to the point where the regression line intercepts the y-axis. Beta corresponds to the slope of the regression line. The model can also be used to make predictions. It is assumed that the relationship between y and the explanatory variables is linear. Our target variable is “Points”. As explanatory variables I chose the ones with the highest correlation coefficient considering its relation to “Points”: attitude, strategic learning and surface learning. Hence our model looks like this:
my_model <- lm(Points ~ attitude + stra + surf, data=learning2014)
The summary can be printed with
summary(my_model)
##
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
It shows the formula that created the model and underneath a five point summary of the residuals of the model. In the table the first column “Estimate” estimates the effect (𝝱 paramter) of the explanatory variables on the dependent variable. For attitude it is estimated with 3.3952, for strategic learning with 0,8531 and for surface learning with -0,5861. The next column gives the standard error, followed by the t-value and the p-value, that are statistical tests. A very low p-value proves a statistical relationship between the dependant and the explanatory variable. In our case a statistical relationship for “Points” with the values strategic learning and surface learning does not exist. Hence we rewrite the model only including attitude.
my_model2 <- lm(Points ~ attitude, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The regression line of “Points” correlating with “attitude” intercepts the y-axis at 11.63 an has a slope of 3.5.
plot(my_model2, which = 1)
In the linear regression model it is assumed that the errors are normally distributed and the size of an error should not dependent on the explanatory variables. This assumption can be explored with a scatter plot of residuals versus model predictions. If the points are randomly spread, without a pattern the assumtion can be proven. Here this is the case. The assumption can be proven.
plot(my_model2, which = 2)
The QQ-plot explores the assumption that the errors of the model are normally distributed.The points fall well on the line, therefor there is a reasonable fit to the normality assumtion.
plot(my_model2, which = 5)
Leverage measures how much impact a single observation has on the model. Plotting Residuals vs. Leverage helps to identify if single observations have a unusually high impact on the model. In our example there is not a single observation that stands out.
In this excercise I am analysing a data set about the drinking behaviour of Portuguese students. At first two dataset were combined in order to analyse them (data wrangling). Later certain variables and their relationships with the students drinking behavior are analysed. The drinking behaviour was converted into a discrete variable. For a discrete target variable, logistic regression is a suitable method for analysis.
alc <-read.csv("/Users/eva-mariaroth/Documents/GitHub/IODS-project/alcjoint", header=TRUE, sep=",")
The dataset consists of 382 observations of 35 variables. See below the the printed names of the variables. For this dataset, two datasets were combined (source: UCI Machine Learning Repository). The data was collected in two Portuguese secondary schools and comprises various information about the students (school, sex, age, address), their family backround, their attitude towards studying and education, how they spend their freetime, their alcohol consumption and their grades (G1, G2, G3). Two questions, concerning the workday alcohol consumption (Dalc) and the weekend alcohol consumtpion (Walc), were combined during the data wrangling to the overall alcohol consumption (alc_use). Another variable concerning high alcohol consumption (high_use) was created. It contains all the observations with a consumption higher then 2, with 1 beeing very low alcohol consumption and 5 beeing very high consumtion.
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Quality of familiy relationships (“famrel”, numeric: from 1 very bad to 5 excellent)
Hypothesis: Students with problematic family relationships have a higher alcohol consumption.
Number of school absences (“absences”, numeric from 0 to 93)
Hypothesis: Students with a high alcohol consumtion are more often skipping school.
Number of past class failures (“failures”, numeric: n if 1<=n<3, else 4)
Hypothesis: High alcohol consumption leeds to a higher amount of failed classes.
Final grade (“G3”“, numeric:from 0 to 20, output target)
Hypothesis: Students with higher alcohol consumption have lower final grades.
selected_variables <- subset (alc, select = c("famrel", "absences", "failures", "G3", "alc_use"))
library(tidyr)
library(ggplot2)
gather(selected_variables) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Now we want to see, how the four selected variables influence on the alcohol consumption. As the grades are approximately normally distributed, we can show the relationship between grades and high alcohol consumtpion nicely with a boxplot.
g1 <- ggplot(alc, aes(high_use, y = G3, col =sex))
g1 + geom_boxplot() + ylab("Grades") + ggtitle("Effects on high alcohol consumption on grades")
We can see, that the male students that have an alcohol consumption of more then two, have lower grades then the ones who drink less. For women there seems to be no difference.
The relation between the high alcohol consumption and the selected variables can also be explored numerically with cross-tabulations. Below we will examine the correlation between a variable and high amount of drinking by comparing the means of the answers, of students that have a high alcohol consumption (TRUE), with the ones of the students that do not have a high consumption (FALSE).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (G3))
## # A tibble: 2 x 3
## high_use count mean_grade
## <lgl> <int> <dbl>
## 1 FALSE 268 11.73507
## 2 TRUE 114 10.80702
tab. 1: Effect of high alcohol consumption on the final grades. We can see from the means, that students with a high alcohol consumption have lower average final grades, with a difference of 0.9 compared to the ones with low alcohol consumption. This supports our hypothesis.
alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (famrel))
## # A tibble: 2 x 3
## high_use count mean_grade
## <lgl> <int> <dbl>
## 1 FALSE 268 4.003731
## 2 TRUE 114 3.780702
tab. 2: Relationship between the quality of family relationships and the high consumption of alcohol. The quality of family relationships ranges from 1 - very bad to 5 - excellent. As we can see from the means, the quality of family relationships of the people with a higher alcohol consumtion tend to be slightly worse (3.7) compared to the ones of the people who drink less (4.0). This confirms our hypothesis.
alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (absences))
## # A tibble: 2 x 3
## high_use count mean_grade
## <lgl> <int> <dbl>
## 1 FALSE 268 3.705224
## 2 TRUE 114 6.368421
tab. 3: Relationship between drinking behaviour and number of absences. Students that have a high alcohol consumption have been nearly twice (1,7 times) as many times absent as the students who do not have a high consumption. that supports our hypothesis.
alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (failures))
## # A tibble: 2 x 3
## high_use count mean_grade
## <lgl> <int> <dbl>
## 1 FALSE 268 0.1417910
## 2 TRUE 114 0.3421053
tab. 4: Relationship between drinking behaviour and number of failed courses. The averaged number of failed classes is slightly higher, when students have a high alcohol consumption. That confirms our initial hypothesis.
After visualising the data we know that the hypotheses for our four selected variables are supported by the data. But can we also prove that statistically. With logistic regression we will now explore statistically the relationship between the variables and the drinking behaviour again.
m<-glm(high_use ~ famrel + absences + failures + G3, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + absences + failures + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2195 -0.7992 -0.6770 1.1782 1.9915
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.02926 0.67814 0.043 0.965585
## famrel -0.22402 0.12525 -1.789 0.073669 .
## absences 0.08174 0.02253 3.628 0.000285 ***
## failures 0.39080 0.19830 1.971 0.048752 *
## G3 -0.04377 0.03794 -1.154 0.248623
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 435.52 on 377 degrees of freedom
## AIC: 445.52
##
## Number of Fisher Scoring iterations: 4
Amongst the four selected variables the number of absences is most strongly correlated with the high alcohol consumption.A p-value < 0.05 indicates a statistical significance. Hence also the amount of failed courses has a statistically significant correlation with high alcohol consumption. The quality of family relationships is iversively related to alcohol consumtion, wich means, the better the relationships the lower the alcohol consumption. However the correlation is not statistically significant. Furthermore there is no significant correlation between the final grades and the drinking behaviour according to the model. Only two of our hypotheses could be proven statistically.
The odds ratio quantifies the relationship between X and Y. In this case Y is high alcohol use and X are the explanatory variables used in our model: quality of family relationships, absences, failures and final grades.
OR <- coef(m) %>% exp()
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
coef(m)
## (Intercept) famrel absences failures G3
## 0.02925904 -0.22402430 0.08174358 0.39080167 -0.04376795
Odds higher then 1 mean that x is positively related with success. Unfortunately this doesnt go anywhere so I cannot interpret the odds ratio.
No we will only select the two variables with a significant relationship to alcohol consumption and write the model again. By the use of the logistic model a prediction of the probability of the high consumption of alcohol can be made. If the probability is higher then 50%, a high consumtion is predicted (= TRUE). The predictions are added to the data set as a new variable as can be seen in the data below that always shows the first ten observations of each question.
m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
head(alc, n=10)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## 7 GP M 16 U LE3 T 2 2 other other
## 8 GP F 17 U GT3 A 4 4 other teacher
## 9 GP M 15 U LE3 A 3 2 services other
## 10 GP M 15 U GT3 T 3 4 other other
## reason nursery internet guardian traveltime studytime failures
## 1 course yes no mother 2 2 0
## 2 course no yes father 1 2 0
## 3 other yes yes mother 1 2 2
## 4 home yes yes mother 1 3 0
## 5 home yes no father 1 2 0
## 6 reputation yes yes mother 1 2 0
## 7 home yes yes mother 1 2 0
## 8 home yes no mother 2 2 0
## 9 home yes yes mother 1 2 0
## 10 home yes yes mother 1 2 0
## schoolsup famsup paid activities higher romantic famrel freetime goout
## 1 yes no no no yes no 4 3 4
## 2 no yes no no yes no 5 3 3
## 3 yes no yes no yes no 4 3 2
## 4 no yes yes yes yes yes 3 2 2
## 5 no yes yes no yes no 4 3 2
## 6 no yes yes yes yes no 5 4 2
## 7 no no no no yes no 4 4 4
## 8 yes yes no no yes no 4 1 4
## 9 no yes yes no yes no 4 2 2
## 10 no yes yes yes yes no 5 5 1
## Dalc Walc health absences G1 G2 G3 alc_use high_use probability
## 1 1 1 3 5 2 8 8 1.0 FALSE 0.2792474
## 2 1 1 3 3 7 8 8 1.0 FALSE 0.2459680
## 3 2 3 3 8 10 10 11 2.5 TRUE 0.5767118
## 4 1 1 5 1 14 14 14 1.0 FALSE 0.2154691
## 5 1 2 5 2 8 12 12 1.5 FALSE 0.2303651
## 6 1 2 5 8 14 14 14 1.5 FALSE 0.3340009
## 7 1 1 3 0 12 12 12 1.0 FALSE 0.2012844
## 8 1 1 1 4 8 9 10 1.0 FALSE 0.2622677
## 9 1 1 1 0 16 17 18 1.0 FALSE 0.2012844
## 10 1 1 5 0 13 14 14 1.0 FALSE 0.2012844
## prediction
## 1 FALSE
## 2 FALSE
## 3 TRUE
## 4 FALSE
## 5 FALSE
## 6 FALSE
## 7 FALSE
## 8 FALSE
## 9 FALSE
## 10 FALSE
No we will compare the prediction to our target variable via cross tabulation.
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 258 10
## TRUE 99 15
The cross tabulation shows that there are 10 values that are FALSE in the original data and predicted as TRUE from the model. Furthermore there are 99 cases were the students said TRUE and the model predicted a false. In 273 cases the model predicted the correct answer. Hence the total proportion of inaccurately classified individuals (= the training error) is 28.53%. Unfortunately the model only predicted 71,47% of the answers correctly. I would say that this is a relatively low amount and the model would need to be adjusted, for example by including more variables that have a strong relation to the drinking behaviour. Finally we can also see the data visualized, though I find the cross tabulation more comprehensible.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2853403
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
There are many data sets already loaded in R or included in the package installations. In this exercise we will practise our analysis on a data set called “Boston” from the “MASS” package. The data set concernes the housing values in the suburbs of Boston. It has 14 varibales with 506 observations, concerning living standards, like accssability and number of rooms.
The description of the variables are:
crim -per capita crime rate by town.
zn -proportion of residential land zoned for lot over 25,000sq.ft.
indus -proportion of non-retail business acres per town.
chas -Charles River dummy variable (=1 if tract bounds river; 0 otherwise).
nox -nitrogen oxides concentration (parts per 10 million).
rm -average number of rooms per dwelling.
age -proportion of owner-occupied units built prior to 1940.
dis -weigthed mean of distances to five Boston employment centres.
rad -index of accessibility to radial highways.
tax -full-valua property-tax rate per 0,000$.
ptratio -pupil-teacher ratio by town.
black -1000(Bk-0.63)^2 where Bk is the proportion of blacks by town.
lstat -lower status of the population (percent).
medv -median value of owner-occupied homes in $1000s.
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
All the observations of the variables are numerival except for indus and rad, which are integer.
Now, lets have a look at the summary of the distributions of the variables. Here we cann see for example, that the mean crime rate is 3.61, whereas the median is 0.25 and the values range between 0.00 and 88.9. The mean of the pupil/teacher ratio is 18.46, which I find surprisingly low.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
With the following barplots we can get a graphical overview of the distribution of the data withing the variables.
library(ggplot2)
library(tidyr)
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Using the pairs function we can see how the variables are distributed and derive first assumptions about their relationships to each other. However, there are too many variables to get an overview with one glance.
pairs(Boston)
library(corrplot)
## corrplot 0.84 loaded
cor_matrix<-cor(Boston)
cor_matrix %>% round(digits=2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
A better option to investigate the relationship of the variables is the corrolationplot. The colour indicates what kind of correlation we have:
red - negative correlation
blue - positive correlation
The size of the cirlce and the intensity of the colour indicate the correlation coefficient.
strong colours - corrolation coefficient is high
light colours - corrolation coefficient is low
The strongest correlation can be found betwenn the index of accessibility to radial highways and the full-valua property-tax rate per 0,000$. The weighted mean of distances to five Boston employment centres correlates strongly with age, nitrogen oxide concentration and the proportion of non-retail business acres per town.
Standardizing the dataset and creating a factor variable for the crime rate. As the data contains only numerical values, we can use the scale() function to standardize the whole dataset. Furthermore we are dividing the dataset into a train and a test set with 80% and respectively 20% of the observations per variable, which are randomly selected.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
The variable crim describes the per capita crime rate by town.
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
head(boston_scaled)
## zn indus chas nox rm age
## 1 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
## dis rad tax ptratio black lstat
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 4 1.076671 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708
## 5 1.076671 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866
## 6 1.076671 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909
## medv crime
## 1 0.1595278 low
## 2 -0.1014239 low
## 3 1.3229375 low
## 4 1.1815886 low
## 5 1.4860323 low
## 6 0.6705582 low
n <- nrow(boston_scaled)
ind <- sample(n, size = n*0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
dim(test)
## [1] 102 14
dim(train)
## [1] 404 14
Fitting a linear discriminant analysis to the training set. Crime is a target varaible and all the other variables are predictors.
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2524752 0.2450495 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 0.86483259 -0.9023069 -0.120902136 -0.8570283 0.43956568
## med_low -0.08074812 -0.3311175 -0.002135914 -0.5804799 -0.12514426
## med_high -0.38156320 0.2460322 0.204895203 0.4187792 0.08762615
## high -0.48724019 1.0171737 -0.033716932 1.0406471 -0.35647105
## age dis rad tax ptratio
## low -0.8211460 0.8453500 -0.6936501 -0.7705826 -0.34676434
## med_low -0.3563129 0.3790438 -0.5483812 -0.5020242 -0.05916205
## med_high 0.4465741 -0.3951306 -0.4366394 -0.3040111 -0.30699364
## high 0.7994240 -0.8443661 1.6375616 1.5136504 0.78011702
## black lstat medv
## low 0.3798850 -0.76313608 0.505895177
## med_low 0.3173297 -0.13465682 0.003468435
## med_high 0.0912241 0.02573923 0.161944008
## high -0.6952551 0.83945692 -0.688784912
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.06089238 0.59913318 -1.0196710
## indus 0.09612264 -0.31872537 0.2542926
## chas -0.07537586 -0.07894164 0.1521931
## nox 0.34311659 -0.77432542 -1.3343262
## rm -0.09902015 -0.10571859 -0.2413598
## age 0.17213446 -0.29763305 -0.2115453
## dis -0.05464855 -0.26691832 0.1308873
## rad 3.34787750 1.07413986 -0.1318456
## tax 0.04436256 -0.12169707 0.6373323
## ptratio 0.09966469 0.06363956 -0.3658725
## black -0.13526602 0.09404649 0.1351621
## lstat 0.23803183 -0.29201064 0.4068448
## medv 0.20922800 -0.41113180 -0.1545870
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9515 0.0363 0.0123
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
correct_classes<-test$crime
test<-dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 7 1 0
## med_low 3 17 4 0
## med_high 0 10 15 2
## high 0 0 0 28
#reloading
data(Boston)
#scale dataset and make data frame
boston_scaled1<-as.data.frame(scale(Boston))
For calculating the distances between the obeservations we use the most common, euclidean distance.
dist_eu <-dist(boston_scaled1)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
Now we apply K-means to the dataset. We start with a number of 4 cluster centers.
km <- kmeans(boston_scaled1, centers = 4)
However, we want to determine the optimal number of clusters.
set.seed(123)
#set the max number of clusters to be ten
k_max<- 10
#calculate total WCSS
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
#visualizing WCSS
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <- kmeans(boston_scaled1, centers = 2)
#visualize Kmeans
pairs(boston_scaled1, col = km$cluster)
pairs(boston_scaled1[6:10], col = km$cluster)
At first we scale the Boston dataset again.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
boston_scaled2 <-scale(Boston)
summary(boston_scaled2)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Now we perform the K-means on the standardized dataset.
km2 <- kmeans(boston_scaled2, center = 3)
boston_scaled2 <- as.data.frame(boston_scaled2)
Finally we perform Linear Discriminant Analysis, using the clusters as target classes, including all the variables from the Boston data in the LDA model.
lda.fit2 <- lda(km2$cluster~., data = boston_scaled2)
Visualizing the results, we can observe…
# function for the lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# setting target classes as numeric
classes <- as.numeric(km2$cluster)
# plotting the lda results
plot(lda.fit2, dimen = 2, col=classes, pch= classes)
lda.arrows(lda.fit, myscale = 1)
We are running the code bellow for the scaled train data that we used to fit the LDA.It creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
Now we are installing and accessing the plotly package in order to create a 3D plot from our data.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
We draw the same plot again, but set the crime classes to be the colours.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
Drawing the same plot again with the colours matching the clusters of the K-means.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$cl)